Linear regression is a very elegant, simple, powerful and commonly used technique for data analysis. We use it extensively in exploratory data analysis (we used in project 2, for example) and in statistical analyses since it fits into the statistical framework we saw in the last unit, and thus let’s do things like construct confidence intervals and hypothesis testing for relationships between variables.

Simple Regression

Let’s start with the simplest linear model. The goal here is to analyze the relationship between a continuous numerical variable \(Y\) and another (numerical or categorical) variable \(X\). We assume that in our population of interest the relationship between the two is given by a linear function:

\[ Y = \beta_0 + \beta_1 X \]

Here is (simulated) data from an advertising campaign measuring sales and the amount spent in advertising. We think that sales are related to the amount of money spent on TV advertising:

\[ \mathtt{sales} \approx \beta_0 + \beta_1 \times \mathtt{TV} \]

Given this data, we would say that we regress sales on TV when we perform this regression analysis. As before, given data we would like to estimate what this relationship is in the population (what is the population in this case?). What do we need to estimate in this case? Values for \(\beta_0\) and \(\beta_1\). What is the criteria that we use to estimate them?

Just like the previous unit we need to setup an inverse problem. What we are stating mathematically in the linear regression problem is that the conditional expectation (or conditional mean, conditional average) of \(Y\) given \(X=x\) is defined by this linear relationship:

\[ \mathbb{E}[Y|X=x] = \beta_0 + \beta_1 x \]

Given a dataset, the inverse problem is then to find the values of \(\beta_0\) and \(\beta_1\) that minimize deviation between data and expectation, and again use squared devation to do this.

The linear regression problem

Given data \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\), find values \(\beta_0\) and \(\beta_1\) that minimize objective or loss function RSS (residual sum of squares):

\[ \arg \min_{\beta_0,\beta_1} RSS = \frac{1}{2} \sum_i (y_i - (\beta_0 + \beta_1 x_i))^2 \]

Similar to what we did with the derivation of the mean as a measure of central tendency we can derive the values of minimizers\(\hat{\beta}_0\) and \(\hat{\beta}_1\). We use the same principle, compute derivatives (partial this time) of the objective function RSS, set to zero and solve to obtain:

\[ \begin{align} \hat{\beta}_1 & = \frac{\sum_{i=1}^n (y_i - \overline{y})(x_i - \overline{x})}{\sum_{i=1}^n (x_i - \overline{x})^2} \\ {} & = \frac{\mathrm{cov}(y,x)}{\mathrm{var}(x)} \\ \hat{\beta}_0 & = \overline{y} - \hat{\beta}_1 \overline{x} \end{align} \]

Let’s take a look at some data. Here is data measuring characteristics of cars, including horsepower, weight, displacement, miles per gallon. Let’s see how well a linear model captures the relationship between miles per gallon and weight

library(ISLR)
library(dplyr)
library(ggplot2)
library(broom)

data(Auto)

Auto %>%
  ggplot(aes(x=weight, y=mpg)) +
    geom_point() + 
    geom_smooth(method=lm) + 
    theme_minimal()

In R, linear models are built using the lm function

auto_fit <- lm(mpg~weight, data=Auto)
auto_fit
## 
## Call:
## lm(formula = mpg ~ weight, data = Auto)
## 
## Coefficients:
## (Intercept)       weight  
##   46.216525    -0.007647

This states that for this dataset \(\hat{\beta}_0 = 46.2165245\) and \(\hat{\beta}_1 = -0.0076473\). What’s the interpretation? According to this model, a weightless car weight=0 would run \(\approx 46.22\) miles per gallon on average, and, on average, a car would run \(\approx 0.01\) miles per gallon fewer for every extra pound of weight. Note, that the units of the outcome \(Y\) and the predictor \(X\) matter for the interpretation of these values.

Inference

As we saw in the last unit, now that we have an estimate, we want to know how good of an estimate this is. We will see that similar arguments based on the CLT hold again. The main point is to understand that like the sample mean, the regression line we learn from a specific dataset is an estimate. A different sample from the same population would give us a different estimate (regression line). But, the CLT tells us that, on average, we are close to population regression line (I.e., close to \(\beta_0\) and \(\beta_1\)), that the spread around \(\beta_0\) and \(\beta_1\) is well approximated by a normal distribution and that the spread goes to zero as the sample size increases.

Confidence Interval

Using the same framework as before, we can construct a confidence interval to say how precise we think our estimates of the population regression line is. In particular, we want to see how precise our estimate of \(\beta_1\) is, since that captures the relationship between the two variables. We again, use a similar framework. First, we calculate a standard error estimate for \(\beta_1\):

\[ \mathrm{se}(\hat{beta}_1)^2 = \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (x_i - \overline{x})^2} \]

and construct a 95% confidence interval

\[ \beta_1 = \hat{\beta}_1 \pm 1.95 \times \mathrm{se}(\hat{beta}_1) \]

Note, \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\). Going back to our example:

auto_fit_stats <- auto_fit %>%
  tidy() %>%
  select(term, estimate, std.error)
auto_fit_stats
##          term     estimate    std.error
## 1 (Intercept) 46.216524549 0.7986724633
## 2      weight -0.007647343 0.0002579633

This tidy function is defined by the broom package, which is very handy to manipulate the result of learning models in a consistent manner. The select call removes some extra information that we will discuss shortly.

confidence_interval_offset <- 1.95 * auto_fit_stats$std.error[2]
confidence_interval <- round(c(auto_fit_stats$estimate[2] - confidence_interval_offset,
                               auto_fit_stats$estimate[2],
                               auto_fit_stats$estimate[2] + confidence_interval_offset), 4)

Given the confidence interval, we would say, “on average, a car runs \(_{-0.0082} -0.0076_{-0.0071}\) miles per gallon fewer per pound of weight.

The \(t\)-statistic and the \(t\)-distribution

As in the previous unit, we can also test a null hypothesis about this relationship: “there is no relationship between weight and miles per gallon”, which translates to \(\beta_1=0\). Again, using the same argument based on the CLT, if this hypothesis is true then the distribution of \(\hat{\beta}_1\) is well approximated by \(N(0,\mathrm{se}(\hat{\beta}_1))\), and if we observe the learned \(\hat{\beta}_1\) is too far from 0 according to this distribution then we reject the hypothesis.

Now, there is a technicality here that we did not discuss in the previous unit that is worth paying attention to. We saw before that the CLT states that the normal approximation is good as sample size increases, but what about moderate sample sizes (say, less than 100)? The \(t\) distribution provides a better approximation of the sampling distribution of these estimates for moderate sample sizes, and it tends to the normal distribution as sample size increases.

The \(t\) distribution is commonly used in this testing situation to obtain the probability of rejecting the null hypothesis. It is based on the \(t\)-statistic

\[ \frac{\hat{\beta}_1}{\mathrm{se}(\hat{\beta}_1)} \]

You can think of this as a signal-to-noise ratio, or a standardizing transformation on the estimated parameter. Under the null hypothesis, it was shown that the \(t\)-statistic is well approximated by a \(t\)-distribution with \(n-2\) degrees of freedom (we will get back to degrees of freedom shortly). Like other distributions, you can compute with the \(t\)-distribution using the p,d,q,r-family of functions, e.g., pt is the cumulative probability distribution function.

In our example, we get a \(t\) statistic and P-value as follows:

auto_fit_stats <- auto_fit %>%
  tidy()
auto_fit_stats
##          term     estimate    std.error statistic       p.value
## 1 (Intercept) 46.216524549 0.7986724633  57.86668 1.623069e-193
## 2      weight -0.007647343 0.0002579633 -29.64508 6.015296e-102

We would say: “We found a statistically significant relationship between weight and miles per gallon. On average, a car runs \(_{-0.0082} -0.0076_{-0.0071}\) miles per gallon fewer per pound of weight (\(t\)=-29.65, \(p\)-value<6.015296110^{-102}$).”

Global Fit

Now, notice that we can make predictions based on our conditional expectation, and that prediction should be better than a prediction with a simple average. We can use this comparison as a measure of how good of a job we are doing using our model to fit this data: how much of the variance of \(Y\) can we explain with our model. To do this we can calculate total sum of squares:

\[ TSS = \sum_i (y_i - \overline{y})^2 \]

(this is the squared error of a prediction using the sample mean of \(Y\))

and the residual sum of squares:

\[ RSS = \sum_i (y_i - \hat{y}_i)^2 \]

(which is the squared error of a prediction using the linear model we learned)

The commonly used \(R^2\) measure comparse these two quantities:

\[ R^2 = \frac{\mathrm{TSS}-\mathrm{RSS}}{\mathrm{TSS}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} \]

These types of global statistics for the linear model can be obtained using the glance function in the broom package. In our example

auto_fit %>%
  glance() %>%
  select(r.squared, sigma, statistic, df, p.value)
##   r.squared    sigma statistic df       p.value
## 1 0.6926304 4.332712  878.8309  2 6.015296e-102

We will explain the the columns statistic, df and p.value when we discuss regression using more than a single predictor \(X\).

Some important technicalities

We mentioned above that predictor \(X\) could be numeric or categorical. However, this is not precisely true. We can use a transformation to represent categorical variables. Here is a simple example:

Suppose we have a categorical variable sex with values female and male, and we want to show the relationship between, say credit card balance and sex. We can create a dummy variable \(x\) as follows:

\[ x_i = \left\{ \begin{align} 1 & \textrm{ if female} \\ 0 & \textrm{o.w.} \end{align} \right. \]

and fit a model \(y = \beta_0 + \beta_1 x\). What is the conditional expectation given by this model? If the person is male, then \(y=\beta_0\), if the person is female, then \(y=\beta_0 + \beta_1\). So, what is the interpretation of \(\beta_1\)? The average difference in credit card balance between females and males.

We could do a different encoding:

\[ x_i = \left\{ \begin{align} +1 & \textrm{ if female} \\ -1 & \textrm{o.w.} \end{align} \right. \]

Then what is the interpretation of \(\beta_1\) in this case?

Note, that when we call the lm(y~x) function and x is a factor with two levels, the first transformation is used by default. What if there are more than 2 levels? We need multiple regression, which we will see shortly.

Issues with linear regression

There are some assumptions underlying the inferences and predictions we make using linear regression that we should verify are met when we use this framework. Let’s start with four important ones that apply to simple regression

Non-linearity of outcome-predictor relationship

What if the underlying relationship is not linear? We will see later that we can capture non-linear relationships between variables, but for now, let’s concentrate on detecting if a linear relationship is a good approximation. We can use exploratory visual analysis to do this for now by plotting residuals \((y_i - \hat{y}_i)^2\) as a function of the fitted values \(\hat{y}_i\).

The broom package uses the augment function to help with this task. It augments the input data used to learn the linear model with information of the fitted model for each observation

augmented_auto <- auto_fit %>%
  augment()
augmented_auto %>% head()
##   .rownames mpg weight  .fitted   .se.fit    .resid        .hat   .sigma
## 1         1  18   3504 19.42024 0.2575448 -1.420236 0.003533343 4.337678
## 2         2  15   3693 17.97489 0.2862653 -2.974889 0.004365337 4.335643
## 3         3  18   3436 19.94026 0.2487426 -1.940256 0.003295950 4.337158
## 4         4  16   3433 19.96320 0.2483756 -3.963198 0.003286232 4.333606
## 5         5  17   3449 19.84084 0.2503543 -2.840840 0.003338799 4.335878
## 6         6  15   4341 13.01941 0.4142337  1.980589 0.009140525 4.337104
##        .cooksd .std.resid
## 1 0.0001911753 -0.3283745
## 2 0.0010380292 -0.6881147
## 3 0.0003326721 -0.4485553
## 4 0.0013838821 -0.9162219
## 5 0.0007225022 -0.6567698
## 6 0.0009727164  0.4592282

With that we can make the plot we need to check for possible non-linearity

augmented_auto %>%
  ggplot(aes(x=.fitted,y=.resid)) +
    geom_point() + 
    geom_smooth() +
    labs(x="fitted", y="residual")

Correlated Error

For our inferences to be valid, we need residuals to be independent and identically distributed. We can spot non independence if we observe a trend in residuals as a function of the predictor \(X\). Here is a simulation to demonstrate this:

In this case, our standard error estimates would be underestimated and our confidence intervals and hypothesis testing results would be biased.

Non-constant variance

Another violation of the iid assumption would be observed if the spread of residuals is not independent of the fitted values. Here is an illustration, and a possible fix using a log transformation on the outcome \(Y\).